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Abstract: Text classification also called (text categorization or text tagging) is a crucial and 
extensively used approach in Natural Language Processing (NLP), to predict unseen 
content documents into prearranged categories. In this paper, we evaluate the dataset 
construction and evaluation process as a component of text classification. To begin with, 
we produced a newly created dataset for Indian Origin Scientists for text classification, 
which was collected by applying focused crawling and web scraping techniques. We then 
demonstrate an extensive evaluation of numerous models on this recently constructed 
dataset. Our evaluations display that the Random forest model outperforms the rest of the 
supervised models. Our results produce a fine beginning for additional research in Indian 
Origin Scientists' classification of text. Investigational outcome with K Nearest Neighbor, 
Logistic Regression, and Support Vector Machine for Indian-origin scientists produced 
much better performances for Random Forest when combined with SMOTE and K fold 
cross-validation techniques. We apply the Area under the ROC Curve to compute the 
effectiveness of the chosen models. Overall, the Random Forest classifier exhibited the best 
output along with 90% micro-average AUC. 


Introduction 

Classification of text is an imperative approach in 
natural language processing, to assign a set of 
prearranged categories to open-ended text data. This task 
is important in major applications like information 
retrieval, topic modeling, sentiment analysis, social 
media analysis, intent detection, and spam detection. A 
huge volume of messy unstructured data is being 
generated every day; text classification with machine 
learning can automatically structure this data to provide 
meaningful insights to make informed business decisions. 
In machine learning, classification means categorizing a 
data item into one or more defined classes. The data may 
belong to different formats like text, image, numbers, or 
speech. Text classification is a process of labeling the text 
data Text 


into one or more groups or classes. 


classification is divided into three sub-parts depending on 
the total number of concerned categories, for example, 
binary classification, multiclass classification, and 
multilabel classification. If there are two classes then it’s 
called binary classification. If there are more than two 
classes then it’s called multiclass classification whereas 
in multilabel classification, a document consists of one or 
more labels/classes attached to it. Text classification is 
also known as topic classification, text categorization, 
or document categorization. Fig 1 shows the steps in 
creating a text classification system. 

The remainder of the paper is structured as: In section 
2, we present an outline of text classification and 
emphasize a few latest web scraping works. 


*Corresponding Author: shivani.gautam @ chitkarauniversity.edu.in 
&c) lovete) This work is licensed under a Creative Commons Attribu- 


BY NC ND 


tion-NonCommercial-NoDerivatives 4.0 International License. 


Unstructured Text Pre- 


Feature 
Data proceessing Extraction 


Int. J. Exp. Res. Rev., Vol. 34: 72-85 (2023) 


Train and Use of 
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Figure 1. Text Classification Pipeline 


It is accompanied by the Design and architecture in 
section 3, wherein we outline the dataset construction 
process. In section 4, various evaluation models are 
explained. In section 5, we cover the experiments and 
results. Finally, in section 6, the conclusion of our work 
and future scope is discussed. 


Related Work 

The most frequently used form of unstructured data 
belongs to the category of texts and speeches. It’s a 
time-consuming and difficult task to extricate effective 
information from unstructured textual data. Text 
classification is the most commonly used NLP approach 
which is used to make informed business decisions in 
various fields. The chief aim of text classification is 
categorizing a class of unknown text documents, mostly 
with the help of supervised machine learning 
algorithms. Various languages like Python, Java, R, 
Prolog, C/C++, or MATLAB can be used for NLP 
tasks. Python programming language is one of the best 
choices for NLP as it consists of lots of packages, tools, 
and libraries. Natural Language Toolkit (NLTK) is a 
Python package that is useful for text classification for 
research purposes. Table 1 displays the comparison of 
numerous text classification techniques being used 
these days. 


Web Scraping 

Web scraping is a process that uses bots to extricate 
contents and data from the website in an organized 
manner (Glez-Pefia et al., 2014). Web scraping can be 
of two types namely manual and automatic, manual 
scraping simply means copy-pasting the required data 
from a web page to a text file whereas automatic web 
scraping extracts data from a website and stores it in a 
structured format automatically by a bot (Saurkar et al., 
2018). The web scraper is provided with the URL that 
needs to be scraped, then it finds specific data that 
needs to be extracted, the code is written and executed 
and the extracted data is stored in the required format. 


This method of extricating data from the pages can be 
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used in various ways. For instance, it can be utilized for 
price comparison, social media scraping, email 
gathering, job listings, and research and development 
(Hillen, 2019). Web scraping seems to be the easy way 
to extract information from web pages, but it has its 
share of challenges. One of them is the protection 
policies of the website and secondly the structural 
changes of the website. Another challenge is to 
condition your web scraper as per the entries given on 
an individual page (Thota et al., 2021). Another 
shortcoming is the extraction of huge volumes of data 
as it would be quite time-consuming due to the 
detection of IP. The quality of retrieved data is another 
condition that may affect the scraping process 
(Dallmeier, 2021). There is a huge number of open- 
source libraries that are used in Python to extricate data. 
Beautiful Soup is the most commonly used library, but 
for JavaScript websites, Selenium can be used, which 
can automate browser activities as well. Scrapy is 
another widespread open-source web crawling structure 
composed in Python. It is useful for web scraping in 
addition to data extraction using APIs as well (Persson, 
2019). 


Design and Architecture (Methodology) 

A list of seed URLs is prepared using the SeoQuake 
plugin of Google Chrome and input into the system. All 
the URLs present in that particular seed URL are 
retrieved using web scraping (beautiful soup library). 
For each URL in the given list, URLS are pre- 
processed. Keyword matching search is performed to 
identify relevant URLs. Then, relevant web pages are 
downloaded. We have used selenium with Python for 
the same. Relevant data is then exported to .csv. 
Depending on the data extraction technique, details of 
the scientists are extricated and kept on the database. 
Then data mining and refinement strategies are 
implemented so that the final database can be created 
and the search interface is prepared. Fig 2 discusses the 
basic design and architecture of the crawler used. 
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Table 1. Comparative Analysis of Text Classification Techniques 


Authors Category Strengths Limitations 
Kilimci et Deep learning & Word Improves the accuracy of Na 
al., 2018 Embedding based classification systems by using 
Ensemble Classifier. multiple base classifiers. 
Patel et al., Word embeddings in Achieve high performance when In the future, 
2019 keyphrase extraction. word embeddings are associated integration of 
with document-specific features, posterior 
performance improvement over regularization in 
complex models. word embedding- 
based CRF models 
can be done. 
Zeng et al., An Ensemble method Improves precision, accuracy, and Na 
2020 used for Text recall over baseline methods. 
Classification in Clinical 
Trials. 
Xu, 2018 Use of Naive Bayes Shows better performance than Na 
classifiers for text classic Naive Bayes when 
classification. Bayesian NB classifier is 
combined with multinomial or 
Gaussian event model. 
Kowsari et A survey of Text A comparison of feature It is difficult to 
al., 2019 classification algorithms. extraction and recent text apply document 
classification techniques has been categorization 
discussed. methods for 
information 
retrieval. 
Mironczuk Outline of the latest Various approaches in ensemble Na 
et al., 2018 components of text learning like bagging, boosting, 
classification. AdaBoost, stacked generalization, 
mixtures of experts, and voting 
methods were compared and 
discussed. 
Karthikeya Usage of web scraping The proposed model delivers a Web scraping is a 
netal., techniques for data better result for the system as it challenge due to 
2019 extraction and text gives the best results during text poor response of 
classification. classification performed on the server and 
extracted data with the help of uneven 
effective web scraping transformation of 
methodologies, with a better data. 
accuracy rate. 
Pavani et A novel web crawling Proposes to retrieve hidden, The performance of 
al., 2017 method for vertical relevant pages by merging rank the model can be 
search engines. and semantic similarity improved further 
information. for relevant web 
pages by using 
various machine 
learning 
algorithms. 
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Yu et al., 


A survey about 


Q 
O 4 O 


The advantages and To find a better 
2018 algorithms utilized by disadvantages of the three combination of 
focused web crawlers. crawling strategies are discussed. algorithms to 
enhance the 
crawling efficiency 
based on a higher 
harvest rate and 
lower 
computational 
costs. 
Lunn et al., Using web scraping and Discusses the effective web Na 
2020 natural language scraping technologies for 
processing to enhance extracting huge volumes of data 
pedagogical teaching. from pages to create datasets. 
Kadhim, Survey on supervised Compares various supervised Na 
2019 machine learning machine learning algorithms to 
algorithms for text organize, and extract features 
classification. from the text documents. 
Dzisevié et Classification of text by Compares various text feature Na 
al., 2019 utilizing various feature extraction methodologies on 
extraction methods. accuracy. Results show that the 
TF-IDF approach achieves the 
highest accuracy of 91% with the 
huge dataset. 
Anglin, Creating a new Describes various web-scraping Na 
2019 framework based on a and text classification techniques 
gather, narrow, extract for documents without 
approach by utilizing web reformatting data. 
scraping and natural 
language processing 
methods. 
Kim et al., Feature extraction using Proposes a method for extracting Na 
2019 text mining for big data. data using text mining for big 
data. 
Schedlbaue Usage of web crawling, Our study reveals how the Na 
et al., 2021 web scraping, and text combination of web crawling and 
mining for medical web scraping techniques creates a 
information market dataset faster for analysis. 
surveys. 
Londo et Survey of Text Results show that the Support The solution 
al., 2019 Classification for News Vector Machine gives 93% needed for the 
Articles. accuracy as compared to other imbalanced dataset 
algorithms. for the 
classification work. 
Onan, An ensemble approach Proposes ensemble of Random Na 
2018 depends on feature 


engineering and language 
function surveys for text 


classification. 


which gives the highest average 


Forest with multiple features 


predictive performance of 


94.43%. 
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al., 2022 


framework for 
classification of text. 


that combines basic deep learning 
models. 


Onan, 2021 Sentiment analysis on Results reveal that ensemble Na 
MOOCS is dependent classifiers outperform the 
on text mining and supervised learning models as 
deep learning they achieve higher accuracy in 
methods. educational data mining. 
Stein et al., Usage of using word Results indicate that the use of CNN exhibited the 
2019 embeddings in the word embeddings is a very worst effectiveness, 
analysis of positive approach to hierarchical So further 
hierarchical text text classification. investigations are 
classification. required for the 
same. 
Gupta et al., Ensemble Studies propose a heterogeneous Web pages have 
2021 classification is used ensemble algorithm that information related 
for web page outperforms basic models. to diverse 
classification. categories which 
makes it difficult to 
classification of 
web pages in each 
category with 
efficiency and 
accuracy. 
Priyadarshini, A Semantic Model Results show that the Ensemble For future scope, 
2021 used for Legal method gives an accuracy of 98% dynamic and live 
document as compared to other conventional streaming of data 
classification by using methods. from the websites 
Ensemble Methods. can also be 
included. 
Mohammed et An ensemble Proposes an ensemble method Na 


Deeksha et 
al., 2021 


Web Page 
Classification using an 
Ensemble approach. 


Proposes an ensemble 
methodology combining various 
basic classification models for 
web page classification to retrieve 
the Indian academician’s pages 
from university web pages 
abroad. 


For a reduction in 
the training time in 
the future, we can 
explore more 
groups of 
classifiers. 


Kuriyozov et 
al., 2023 


Uzbek language text 
classification dataset 
survey. 


Reveals that Bert-based models 
perform better than other models 
and achieve the highest scores. 


The model can be 
tested on larger 
datasets to further 
improve the 
performance. 


Tanasescu et 
al., 2022 


Impact of Big Data 
ETL Process on Text 
Mining study. 


Shows the effectiveness of web 
scraping techniques for the 
collection and analysis of data. 


Pre-trained deep 

learning models 

can be created to 
improve the 
performance. 
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Proposes a model for text 
classification using AUC as 
performance metrics. Also reveals 
a random forest model giving the 
best results up to 90% using the 
proposed model. 


Proposes an improved focused 
crawling approach using LSTM— 
CNN-based text classification 
model. 


Na 


Reveals how the combination of 
web scraping and supervised 
learning techniques gives better 
results. 


Na 


Shows how the combination of 
web scraping and focused web 
crawling techniques extracts large 
amounts of data in a small amount 
of time. 


Na 


Proposes a classification 
framework based on composite 
variables that outperforms all the 
basic models. 


Landu et Text classification and 
al., 2022 machine learning for online 
News Articles. 
Shrivastava Usage of deep learning 
et al., 2023 models for the creation of 
an efficient focused 
crawler. 
Kaur, 2022 Usage of web scraping in 
sentiment analysis for news 
data with the help of 
machine learning 
algorithms. 
Muchlethal Usage of web crawling and 
er et al., web scraping techniques 
2021 for collecting textile data 
from the web. 
Vueelet A new text method for 
al., 2022 classification of reviews. 
Bajaj et al., Text classification and 
2023, Bajaj feature selection in flog 
et al., cloud computing. 
2022 


Proposes a text classification and 
feature selection approach for 
offering offloading solutions in 
fog cloud computing. 


Input Seed 
URLs 


Web Scraping 
of URLs 


Url pre- 


processing 


Keyword 
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identify 
relevant 
URLs 


contents from 
relevant URLs 
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Stop words 


Download 


Structured 


Stemming Data 


removal 


Figure 2. Pipeline of the proposed work 
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A ss c D 
1 |Name [Subject Designation Contact no. 
2 |Kodye Abbott IBL-S Postdoctoral Fell NA 
3 |Eiman Azim MNL-E Assoc Professor (858) 453-4100 xt 1074 
4 |David Acton MNL-G Staff Scientist (858) 453-4100 xt 1555 


Ss |David Adamowicz LOG-G 
6 |Patrick Adams SEC 
7 \TrinkaAdamson ARD 
8 |Harini Adivikolanu SNL-KT 


9 |ShayanAfshar —_—sTPEREI 
10 |Ravi Agarwal LOG-G 
11 |RenataSantos —_ LOG-G 
12 ArchitaAgrawal = PBL-A 


13 |Marcelo Aguilar-Ri\ PBL-L 

14 Hoda Ahmed MCBL-W 
15 |Nasiha Ahmed MCBL-A 
16 |Christine Aiello FIN 


17 |Sriram Aiyer LOG-L 
18 |FerasAlomireen GRDEV 
19 | Dinh Albright VCL-A 


Visiting Mentore (858) 453-4100 xt 1009 
Contractor (858) 453-4100 xt 1570 
Sr Dir, Animal Re (858) 453-4100 xt 1589 


Intern NA 
Research Asst! NA 
Research Asst! (858) 453-4100 xt 1421 


Research Collabc (858) 453-4100 xt 1009 
Postdoctoral Fell NA 
Visiting Scientist NA 
Graduate Studen NA 
Postdoctoral Fell NA 
Sr Dir, Finance (858) 453-4100 xt 1696 
Sr Research Asso (858) 453-4100 xt 2144 
Grants Dev Appli (858) 453-4100 xt 1507 
Lab Coordinator I (858) 453-4100 xt 1016 


Email 
kabbott@salk.edu 
eazim@salk.edu 
dacton@salk.edu 
dadamowicz@salk.edu 
NA 
tadamson@salk.edu 
hadivikolanu@salk.edu 
safshar@salk.edu 
ragarwal@salk.edu 
resantos@salk.edu 
aagrawal@salk.edu 
maguilarrivera@salk.edu . 
hahmed@salk.edu 
nahmed@salk.edu 
caiello@salk.edu 
saiyer@salk.edu 
falomireen@salk.edu 
dinhd@salk.edu 


Figure 3. Overview of the final dataset retrieved 


The workflow of the suggested methodology is shown 
in Algorithm 1 given below. It is divided into 3 stages. 


Algorithm 1 
Stage 1: 
URL Segregation 
i. The seed URL is given as input 
ii. All the URLs present in that particular seed URL 
are retrieved using web scraping(beautiful soup 
library) 
iii. For each URL in the given list: 
iv. | URLs are pre-processed (tokenization, stopwords 
removal, stemming) 
v. If (tokens == people, staff, directory, contacts, 
search) then 
vi. Relevant URLs 
vii. __ else Irrelevant urls 
Stage 2: 
Processing of Data 
¢ All the filtered (relevant URLs) are considered 
seed URLs at this stage. 
¢ For each relevant URL in the list: 
¢ Filling in the list of surnames or designations 
with the help of the Selenium tool( which 
automates the task of downloading web pages by 
filling in all the different surnames or 
designations) 
* For Each web page downloaded: 
* Processing is performed 
— Html tags are removed. 
— Tokenization of words is performed 
(splitting a large sample of text into 
words). 
— Stop words are removed. 
DOE: https://doi.org/10.52756/ijerr.2023.v34spl.008 


— Stemming is performed. 
— Tagging of words is performed 
— Create a list as the final data set in .csv 
format 
¢ For each fetched name in .csv: 
¢ — If(fetched name or fetched _university in the 
dictionary of Indian surnames and Indian 
Universities) 
* Label_data=1 
* Else label_data=0 
* Else Irrelevant URL 
Stage 3: 
Filtered Classifier 
i. Load the CSV file 
ii. Convert string columns to numeric using label 
encoding 
ili. | Split the dataset into features and labels 
iv. Address class imbalance using 
SMOTE(Synthetic Minority Oversampling 
Technique) Oversampling 
v. Split the resampled data into training and testing 
sets 
vi. __ Train and evaluate different classification models 
using 10-fold cross-validation 
The final dataset obtained through the above 
algorithm is being shown below in Fig 3. 


Evaluation Models 

In this paper, we have executed various experiments 
to calculate the performance of individual models on the 
text classification approach on Indian-origin scientists’ 
retrieved data. The mentioned models have been utilized 
for research. 
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Logistic Regression 

Logistic regression is one of the most extensively 
used machine learning algorithms which is used to 
estimate distinct values dependent on a given group of 
independent variables. Its output value ranges between 1 
and 0 as in spam filtering or fraud detection. In logistic 
regression, the algorithm helps to predict a linear 
relationship between the input and the output variables. 
Decision Trees 


A decision tree is a widespread machine learning 
algorithm that is used for classification as well as 
regression purposes. It can be represented by a binary 
tree which helps to estimate real values. Each node in 
the tree is considered an input variable x with a split 
point and each leaf in the tree consists of an output 
variable y which is used for prediction. 

Support Vector Machines 

SVM is a supervised machine learning model which is 
used for classification as well as regression. The main 
function of SVM is to maximize the distance between the 
hyperplane and the training sample dataset that is nearest 
to the given hyperplane. It is used for datasets having 
exactly two classes. 

KNN 


KNN belongs to  thefamily of supervised 
learning algorithms. KNN is also called lazy learner as no 
learning is needed in the model. It classifies objects as per 
the classes of their closest neighbors in the given dataset. 
It takes into consideration that the more the objects are 
closer to each other; the more there are chances of 
similarities. Classification is done by a majority vote to its 
neighbors. 

Random Forests 

Random forest is an ensemble machine learning 
algorithm that is used for both classification and 
regression tasks. It is a twisted version of decision trees 
which consists of multiple decision trees that work 
together to make predictions. Each tree is being trained on 
an individual subgroup of the data. The concluding 
prediction is built by combining the predictions of all the 
decision trees in the forest. The greater number of trees in 
the forest leads to higher accuracy which in turn also 
prevents the issue of overfitting. 


A. Addressing class imbalance using SMOTE 
(Synthetic Minority Oversampling Technique) 

Imbalanced data can be defined as the type of dataset 
where the target class has disproportionate distribution of 
observations. In simple words, the imbalanced dataset is 
where the target variable has more observations in one 
specific class than the others. The problem of unbalanced 
datasets can be solved through an oversampling technique 
called synthetic minority oversampling (SMOTE). This 
algorithm creates new sample data by generating synthetic 
examples which is an amalgamation of the nearby 
minority classes. After running our dataset through 
SMOTE, we gathered a bigger dataset with a balanced 
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number of classes.It also overcomes the overfitting 
problem which is raised by random oversampling. 

B. K-Fold Cross-Validation. 

Cross-validation is commonly used in machine 
learning algorithms for the improvement of model 
prediction as there is limited data to implement other 
better efficacy methods. If the dataset is big enough, then 
a test/train split can be used. But in the real world, we 
hardly have big enough datasets that restrict the test/train 
split efficacy. To solve this issue of limited data, a 
resampling procedure called k-fold cross-validation is 
used. This procedure has an exclusive variable k which 
defines the number of groups that a particular data sample 
is divided into. This technique also aids in avoiding the 
overfitting problem as well which happens when a model 
is trained with the whole of the dataset. 

Finally, our goal is to view a contrast between the 
performances of the mentioned supervised machine 
learning algorithms. Furthermore, before moving towards 
creating the models, we need to divide the dataset 
between training and testing data. To implement the step, 


we need to initiate a sklearn function known as 


train_test_split and then we need to design it to reserve 
80% as training data of the total dataset. 

C. Performance Metrics for Evaluation 

Aconfusion matrix is a tabular representation that 
determines the performance of machine learning models 
on a given collection of test data. The matrix exhibits 4 
variables: the number of true positive (TP), false positive 
(FP), true negative (TN), and false negative (FN) which 
is produced by the framework on the test data. The matrix 
obtained will be a 2X2 table for binary classification. 
From the confusion matrix, we can retrieve the 
following metrics 
Accuracy: The accuracy metric is used to measure the 
performance of the model. It is the number of correct 
instances to the total number of instances. 

TP + TN 


Accuracy = tJ 
Y~ (Ip + FP + TN + FN) 
Precision: Precision metric is a measure of how accurate 
positive predictions are. It is defined as the predictions 

that are true to the total positive predictions. 


peaeiel TP 
recision = ————_— 
(TP + FP) 
Recall: Recall metric helps in measuring the 


effectiveness of a classification model by calculating the 
ratio of actual positive instances that were identified 
incorrectly. It is defined as the number of true positive 
instances to the sum of true positive and false negative 
instances. 

TP 


Recall = ————— 
eca" ~ (EP -+ EN) 
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Table 2. Accuracy percentage comparison with different classifiers 


Algorithm Accuracy Precision Recall F1-Score 
Random Forest 90% 0.94 0.87 0.90 
Support Vector Machine 69.2% 0.79 0.56 0.65 
K Nearest Neighbor 81.5% 0.92 0.70 0.79 
Logistic Regression 60.9% 0.64 0.58 0.61 
Simple Cart 81.3% 0.88 0.74 0.80 
Decision Tree 81.3% 0.88 0.74 0.80 


F1-Score: The Fl-score metric is used to evaluate the 
performance of a binary classification model. It can be 
calculated as the harmonic mean of recall and precision. 
F1 Score = 2* Precision * Recall / ( Precision + 
Recall ) 
AUC-ROC curve: AUC-ROC Curve metric is used to 
visualize the performance of a classification model on 
charts. ROC depicts a graph to display the execution of a 
classification model at various conception levels. The 
Receiver Operating Characteristic curve is drawn 
between two variables called True Positive Rate (TPR) 
and False Positive Rate (FPR) respectively. In the curve, 
TPR is represented on the Y-axis, whereas FPR is drawn 
on the X-axis. The value of AUC varies between 0 and 1. 
A perfect model will always have an AUC value close to 
1, and therefore it will display a perfect estimate of 
separability. 


Experiments and Results 

In this paper, we have executed various experiments 
to calculate the performance of individual models on the 
text classification approach. Confusion matrices are 
obtained during the classification process concerning the 
dataset obtained for the seed URL salk.edu. In this 
experiment, SMOTE with cross-validation is performed 
using various supervised learning algorithms with ten 
folds. We prepared every model with the training dataset, 
adjusted and tweaked them by using the estimated 
dataset, and then tested the performance of the model by 
using the test dataset. 

Here, we display the output of our experiments with 
various models used for classification of text on Indian 
origin scientist’s dataset. We analyzed the performance of 
our models by utilizing specific metrics including 
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precision, accuracy, recall, Fl- score, and AUC-ROC as 
shown in Table 2. Based on the results of the model's 
performances, it can be deduced that the Random Forest 
model works best with 90 % accuracy. Random forest 
performs better than all the other models, and their 
performance is enhanced by adding SMOTE and k-fold 
cross-validation. The output of our analysis reveals the 
efficacy of the Random forest model for text 
classification on Indian-origin scientist's datasets and 
presents a strong groundwork for additional study in this 
field. 


A. Comparative Analysis 

The best supervised algorithm is Random Forest for 
classification purposes as shown through Confusion 
matrices and AUC-ROC curve of various classification 
algorithms. 


Confusion Matrix 

Confusion matrices are obtained during the text 
classification process and shown in Fig 4 concerning the 
salk.edu dataset. 


AUC ROC Curve 

In the experiment, AUC-ROC curves are obtained 
during the text classification process and shown in Fig 5 
concerning the salk.edu dataset. It is apparent from the 
plots shown for various algorithms that the AUC-ROC 
for the Random Forest is higher than any other ROC 
curves. Hence, we can conclude that Random Forest 
works best in classifying the positive class in the dataset. 
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Figure 4. Matrix representation of different classifiers implementation 
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Figure 5. AUC-ROC representation of different classifiers implementation 


Conclusion and Future Work 

In this study, we proposed to handle the text 
classification approach for the retrieval of the Indian- 
origin scientist dataset. Our research leads to the 
creation of a new dataset using focused crawling and 
web scraping techniques. Through the web scraping 
process, the unstructured data is retrieved and then 
converted into a structured format for additional 
research. The text classification task is constituted 
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from supervised machine learning algorithms for 
training the model with the prepared data. In our 
investigations, we estimated the performance of 
various supervised models. Our evaluation results 
showed that the SMOTE with Standard Random Forest 
model using 10-fold cross-validation outperformed 
other models and achieved the highest fl-score of 
90%. The output of aforesaid work shows the top 
performance course for text classification. In future 
work, we propose to enhance the performance of the 
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models by calibrating them on a bigger dataset and to 
expand the research to more NLP approaches. 
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